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(57) An Image tracking system 2 comprises a video r 
camera 3 or other device for presenting a sequence of 
images to a processor 4. The images are presented in real 
time in the form of a sequence of video fields. The 
processor 4 determines (79, fig 18) the position of a target 
image in a preceding image frame, for instance using a 
template matching technique limited to the region of the 
image field where the target image may be located. The 
processor 4 then determines (76, fig 18) the amount of 
movement between the preceding field and the next field, 
for instance by using a differential technique. The updated 
position of the target image is then formed (77, fig 18) as 
the sum of the position of the target image in the 
preceding field and the amount of movement between the 
preceding field and the next consecutive field. The position 
data are supplied (78, fig 18), for instance, to a window 
steering mechanism 6 of an autostereoscopic 3D display 7 
so that the eyes of an observer 8 can be tracked to ensure 
autostereoscopic viewing of the display 7 with an enlarged 
degree of freedom of movement of the observer. 
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At least one drawing originally fled was informal and the print reproduced here is taken from a later filed formal copy. 
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Sharp K.K. 

IMAGE TRACKING SYSTEM AND METHOD AND OBSERVER 
TRACKING AUTOSTEREOSCOPIC DISPLAY. 

The present invention relates to an image tracking system and method. 
The present invention also relates to an observer tracking 
autostereoscopic display in which such a system and method may be 
used. The system and method may also be used in other applications, 
such as security surveillance, video and image compression, video 
conferencing, computer games, driver monitoring, graphical user 
interfaces, camera auto-focus systems and multimedia. 



Autostereoscopic displays are well known and examples are disclosed in 
EP 0 602 934, EP 0 656 555, EP 0 708 351, EP 0 726 482 and GB 
9619097.0. Figure 1 of the accompanying drawings illustrates 
schematically the basic components of a typical autostereoscopic display. 
The display comprises a display system 1 and a tracking system 2. The 
tracking system 2 comprises a tracking sensor 3 which supplies a sensor 
signal to a tracking processor 4. The tracking processor derives from the 
sensor signal an observer position data signal which is supplied to a 
display control processor 5 of the display system 1. The processor 5 
converts the position data signal into a window steering signal and 
supplies this to a steering mechanism 6 which cooperates with a display 
7 such that an observer 8 can view the display autostereoscopically 
throughout an extended range of observer positions. 

Figure 2 of the accompanying drawings illustrates, purely by way of 
example, part of a display system 1 including the display 7 and the 
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and 22 illustrate the theoretical longitudinal viewing freedom for the 
display 7. 

In order to extend the viewing freedom of the observer, as described 
hereinbefore, observer tracking and control of the display may be 
provided. The positions of the viewing windows 19 and 20 are "steered" 
to follow movement of the head of the observer so that the eyes of the 
observer remain within the appropriate viewing zones. An essential part 
of such a display is the tracking system 2 which locates the position of 
the head and/or eyes of the observer. In effect, it is generally only 
necessary to track the centre point between the eyes of the observer 
because this is the position where the left and right viewing windows 
meet as shown in the left part of Figure 4b. Even for relatively large 
head rotations as shown in the right part of Figure 4b, such a system 
accurately positions the viewing windows 19 and 20 so as to mamtam 
autostereoscopic viewing. 

Each viewing window has a useful viewing region which is characterised 
by an illumination profile in the plane 23 as illustrated in Figure 5 of the 
accompanying drawings. The horizontal axis represents position ,n the 
plane 23 whereas the vertical axis represents illumination intensity. The 
ideal illumination profile would be rectangular with the adjacent window 
profiles exactly contiguous. However, in practice, this is not achieved. 

As shown in Figure 5, the width of the window is taken to be the width 
of the illumination profile at half the maximum average intensity. The 
profiles of the adjacent viewing windows are not exactly contiguous but 
have an underlap (as shown) or an overlap. There is variation in 
uniformity for the "top" of the profile, which represents the useful width. 
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2 illuminates the next strip 26 in the direction of movement and 
extinguishes the opposite or trailing strip. 

In order to match the position data obtained by the tracking system 2 to 
the display window positions, a calibration process is required, for 
instance as disclosed in EP 0 769 881 . A typical display 7 provides 
viewing zones in the shape of cones or wedges, such as 28 as shown in 
Figure 7 of the accompanying drawings, which emanate from a common 
origin point referred to as the optical centre 29 of the display. The 
viewing zones determine the positions at which switching must take 
place whenever the centre of the two eyes of the observer moves from 
one window position to another. In this case, the viewing zones are 
angularly spaced in the horizontal plane specified by the lateral direction 
(X) and the longitudinal direction (Z) of the observer with respect to the 
display. 

An ideal tracking and display system would respond to any head 
movement instantaneously. In practice, any practical tracking and 
display system always requires a finite time, referred to as the system 
response time, to detect and respond to head movement. When there is 
only a finite number of steps for moving the viewing windows, an instant 
response may not be necessary. The performance requirements of the 
tracking system are then related to the distance an observer can move his 
eyes before the position of the viewing windows needs to be updated. 

For the autostereoscopic display illustrated in Figure 2 producing the 
window steps illustrated in Figure 6, the observer can move by a 
distance d equivalent to one step before the system needs to respond and 
update the window position. The distance d and the maximum speed v 
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T = (d-e)/v 



where e is the measuring error. The broken line 31 in Figure 8 illustrates 
the response time where e is 5 millimetres. Thus, the required response 
time is reduced to 22 milliseconds for a maximum head speed of 500 
millimetres per second. 

It is desirable to reduce the measuring error e but this cannot in practice 
be reduced to zero and there is a limit to how small the error can be 
made because of a number of factors including image resolution and the 
algorithms used in the tracking. In general, it is difficult to determine the 
measuring error until the algorithm for measuring the position data is 
implemented. For this reason, the above equation may be rewritten as: 



v-(d-e)^" 



This gives the maximum head speed at which an observer can see a 
continuous 3D image for a given measuring error and a given response 
time. The smaller the measuring error and the shorter the response time, 
the faster an observer can move his head. The step size, the measuring 
error and the system response time should therefore be such as to 
provide a value of v which meets the desired criterion, for instance of 
500 millimetres per second. 

A known type of infrared tracking system based on detecting infrared 
radiation reflected from a retro reflective spot worn by an observer 
between his eyes is called the DynaSight sensor and is available from 
Origin Instruments. The 3D coordinates of the retroreflective spot with 
respect to an infrared sensor are obtained at a rate of up to 64 Hz. This 
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Another technique is disclosed in the following papers: 

T S Jebara and A. Pentland, "Parametrized Structure from Motion for 3D 
Adaptive Feedback Tracking of Faces", MIT Media Laboratories, 
Perceptual Computing Technical Report 401, submitted to CVPR 
November 1996; A. Azarbayejani e, al "Real-Time 3D Tracking of the 
Human Body" MIT Laboratories Perceptual Computing Section Techmcal 
Report No. 374, Proc IMAGE'COM 1996, Bordeaux, France, May 1996; 
N Oliver and A. Pentland "LAFTER: Lips and Face Real Time Tracker- 
MIT Media Laboratory Perceptual Computing Section Technical Report 
No 396 submitted to Computer Vision and Pattern Recognition 
Conference, CVPR'96; and A. Pentland "Smart Rooms", Scientific 
American, Volume 274, No. 4, pages 68 to 76, April 1996. However, 
these techniques rely on the use of a number of sophisticated algorithms 
which are impractical for commercial implementation. Further, certain 
lighting control is necessary to ensure reliability. 

Another video camera based technique is disclosed in A. Suwa e. al "A 
video quality improvement technique for videophone and 
videoconference terminal", IEEE Workshop on Visual Signal Processmg 
and Communications, 21-22 September, 1993, Melbourne, Austraha. 
This technique provides a video compression enhancement system usmg 
a skin colour algorithm and approximately tracks head position for 
improved compression ratios in videophone applications. However, the 
tracking precision is not sufficient for many applications. 

Most conventional video cameras have an analogue output which has to 
be converted to digital data for computer processing. Commeraally 
available and commercially attractive video cameras use an interlaced 
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The time required to capture a field of an image is 20 milliseconds for a 
PAL camera operating at 50 fields per second and 16.7 milliseconds for 
an NTSC camera operating at 60 fields per second. As described 
hereinbefore and illustrated in Figure 8, for a typical autostereoscopic 
display, the tracking system 2, the display control processor 5 and the 
steering mechanism 6 shown in Figure 1 have only about 22 
milliseconds to detect and respond to head movement for a maximum 
head speed of 500 millimetres per second and a measuring error of 5 
millimetres. If a PAL camera is used, the time left for processing the 
image and for covering other latencies due to communication and 
window steering is about 2 milliseconds. This time is increased to about 
5.3 milliseconds if an NTSC camera is used. Thus, the available time 
limits the processing techniques which can be used if standard 
commercially attractive hardware is to be used. If the actual time taken 
exceeds this time limit, the observer may have to restrict his head 
movement speed in order to see a flicker-free stereo image. 

Although the time required for digitising a video field may be reduced if 
a non-standard high speed camera is used, this is undesirable because of 
the substantially increased costs. Even if a high speed camera is used, 
there may be a limit to how fast it can be operated. It is very desirable 
to avoid the need for special light sources, whether visible or infrared, in 
order to achieve cost savings and improved ease of use. Thus, the 
tracking system 2 should be able to work with ordinary light sources 
whose intensities may oscillate at 100 or 120 Hz using the normal power 
supply i.e. twice the power supply frequency of 50 Hz, for instance in 
the UK, or 60 Hz, for instance in USA. If a camera is operating at a 
speed close to or above this frequency, images taken at different times 
may differ significantly in intensity. Overcoming this effect requires extra 
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computing complexity which offsets advantages of using high speed 

cameras 



There „ a practical limit to the computing power available in terms of 
cos, for any potential commercial implementation. Thus, a low 
resoluhon camera is preferable so that the volume of image data is as 
small as possible. However, a video camera would have to cover a field 
of view a, leas, as.large as the viewing region of an au.ostereoscopic 
d'splay, so that the head of the observer would occupy only a small 
port.on of the image. The resolution of the interesting image regions 
such as the eyes would therefore be very low. Also, the use of field rate 
halves the resolution in the vertical direction. 

There are many known techniques for locating the presence of an object 
or targe, image" within an image. Many of these techniques are 
complicated and require excessive computing power and/or high 
resolu„on Images in order to extract useful features. Such techniques are 
therefore impractical for many commercial applications. 

A known image tracking technique is disclosed by R. Brune.li and T 
Pogg,o "Face Recognition: Features Versus Templates", IEEE Trans on 
Pattern Analysis and Machine ,n te |, ig ence. Volume ,5 No. ,0, October 
1993. Th,s technique is illustrated in Figure ,2 of the accompanying 
drawmgs. In a first step 45, a "template" which contains a copy of the 
targe, image to be located is captured. Figure ,3 illustrates an image to 
be searched a, 46 and a template 47 containing the target image. After 
the template has been captured, i, is used to interrogate all subsections of 
each , mage field in turn. Thus, a, step 48, the lates, digitised image is 
acqu,red and. a, step 49, template matching is performed by finding the 



13 



position at which there is a best correlation between the template and 
the "underlying" image area. In particular, a subsection of the image 46 
having the same size and shape as the template is selected from the top 
left corner of the image and is correlated with the template 47. The 
correlation is stored and the process repeated by selecting another 
subsection one column of pixels to the right. This is repeated for the top 
row of the image and the process is then repeated by moving down one 
row of pixels. Thus, for an image having M by N pixels and a template 
having m by n pixels, there are (M-m+1) by (N-n+1) positions as 
illustrated in Figure 14 of the accompanying drawings. The cross- 
correlation values for these positions form a two dimensional function of 
these positions and may be plotted as a surface as shown in Figure 15 of 
the accompanying drawings. The peak of the surface indicates the best 
matched position. 

A step 50 determines whether the peak or best correlation value is 
greater than a predetermined threshold. If so, it may be assumed that the 
target image has been found in the latest digitised image and this 
information may be used, for instance as suggested at 51, to control an 
observer tracking autostereoscopic display. When the next digitised 
image has been captured, the steps 48 to 51 are repeated, and so on. 

Although template matching is relatively easy for computer 
implementation, it is a computing-intensive operation. Direct template 
matching requires very powerful computer hardware which is impractical 
for commercial implementation. 

According to a first aspect of the invention, there is provided an image 
tracking system comprising first means for presenting a sequence of 
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The previously presented image may comprise each of the images of the 
sequence in turn. 

The first means may be arranged to present the sequence of images in 
real time. The subsequently presented image may be the currently 
presented image. The first means may comprise a video camera. 

The first means may comprise a memory for storing the previously 
presented image and the subsequently presented image. 

The sequence of images may comprise consecutive fields of interlaced 
fields. 

The second, third and fourth means may comprise a programmed data 
processor. 

The fourth means is preferably arranged to add the movement 
determined by the third means to the position determined by the second 



means. 



The third means may be arranged to determine the movement as soon as 
the subsequently presented image has been presented by the first means. 
The second means may be arranged to determine the position of the 
target image in the subsequently presented image as soon as the third 
means has determined the movement. 

The second means may be arranged to search for the target image in a 
first image portion which is smaller than the images of the sequence and 
which includes the position indicated by the fourth means. The position 
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smaller than the target image. The third means may be arranged to 
determine translational movement of the target image. The third means 
may be arranged to solve a set of equations: 

«x ? .y l )=«x l ,y i )+^- 2 ^+^— ^~ 

where x, and y, are Cartesian coordinates of an ith image element, i is 
each integer such that 1<i<j and j is an integer greater than one, f, and f 2 
are functions representing the previously and subsequently presented 
images and Ax and Ay are the Cartesian components of the movement. 

According to a second aspect of the invention, there is provided an 
observer tracking autostereoscopic display including a system in 
accordance with the first aspect of the invention. 

The first means may comprise a video camera whose optical centre is 
disposed at an optical centre of the display. 

According to a third aspect of the invention, there is provided an image 
tracking method for sequentially presented images, comprising 
determining the position of a target image in a previously presented 
image, determining movement of the target image between the 
previously presented image and a subsequently presented image, and 
indicating the position of the target image in the subsequently presented 
image as the position in the previously presented image mod.f.ed by the 
determined movement. 
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Figure 1 is a schematic diagram of a known type of observer tracking 
autostereoscopic display; 

Figure 2 is a diagrammatic plan view of a specific known type of 
autostereoscopic display; 

Figure 3 is diagram illustrating the generation of viewing windows in 
autostereoscopic displays; 

Figure 4a is a diagrammatic plan view illustrating the generation of 
viewing zones in autostereoscopic displays; 

Figure 4b illustrates the desired relative positions of viewing windows 
and the eyes of an observer for horizontal and horizontally tilted eye 
positions. 

Figure 5 is a graph illustrating a typical intensity profile of a viewing 
window of an autostereoscopic display; 

Figure 6 is a diagram illustrating discrete positions of a viewing window 
of an autostereoscopic display; 

Figure 7 is a diagrammatic plan view illustrating the generation of 
viewing zones in an autostereoscopic display; 

Figure 8 is a graph illustrating observer tracking response time as a 
function of maximum observer head speed; 
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Figure 9 illustrates an image frame composed of interlaced odd and even 

fields; 



Figure 10 is a diagram illustrating the timing of field digitisation and 
processing; 



Figure 1 1 is a diagram illustrating the use of a ring buffer for 
simultaneous digitisation and processing; 

Figure 12 is a flow diagram of a known template matching technique; 

Figure 13 illustrates template matching of an image and a suitable 
template; 

Figure 14 is a diagram illustrating the number of iterations required for 
template matching throughout a whole image; 

Figure 15 illustrates a two dimensional surface representing cross- 
correlation values for different image positions; 

Figure 16 is a schematic diagram illustrating an observer tracking display 
and a tracking system constituting embodiments of the invention; 

Figure 1 7 is a general flow diagram illustrating an image tracking method 
constituting an embodiment of the invention; 

Figure 18 is a more detailed flow chart of the method illustrated in 
Figure 17; 
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Figure 19 illustrates the appearance of a display during template capture; 

Figure 20 illustrates the limited region of the image for which template 
matching is performed; 

Figure 21 illustrates hierarchical template matching; 

Figure 22 is a diagram illustrating differential movement determination; 

Figure 23 is a diagram illustrating the timing of the method illustrated in 
Figures 1 7 and 18; 

Figure 24 is a diagrammatic plan view illustrating a preferred position of 
a video camera with respect to a 3D display; 

Figure 25 illustrates an alternative technique for template matching; and 

Figure 26 is a diagram illustrating hue (H), saturation (S), value (V) space. 

Like reference numerals refer to like parts throughout the drawings. 

Figure 16 shows an observer tracking autostereoscopic display 
constituting an embodiment of the invention and including a video 
image tracking system also constituting an embodiment of the invention. 
The tracking system 2 shown in Figure 16 differs from that shown in 
Figure 1 in that the tracking sensor 3 comprises a Sony XC999 NTSC 
camera operating at a 60 Hz field rate and the tracking processor 4 is 
provided with a mouse 60 and comprises a Silicon Graphics entry level 
machine of the Indy series equipped with an R4400 processor operating 
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at 150 MHz and a video digitiser and frame store having a resolution of 
640x240 picture elements (pixels) for each field captured by the camera 
3. The camera 3 is disposed on top of the 3D display 7 and points 
towards the observer 8 who sits in front of the display. The normal 
distance between the observer 8 and the camera 3 is about 0.85 metres 
at which distance the observer has a freedom of movement in the lateral 
or X d.rection of about 500 millimetres. The distance between two 
pixels in the image formed by the camera 3 corresponds to about 0 7 
and 1.4 millimetres in the X and Y directions, respectively, the Y 
resolution being halved because each interlaced field is individually 
used. The template described hereinafter is selected to have 150x50 
Pixels, corresponding to a region of about 105x70 millimetres. The 
mouse 60 is used during template capture as described hereinafter. The 
camera 3 captures and presents to the processor 4 a continuous 
sequence of images of the user under ambient lighting. 

Figure 17 illustrates in general terms the tracking method performed by 
the processor 4. In an initialisation stage, a template comprising a target 
-mage ,s captured interactively at step 61. Following the initialisation 
stage, a tracking stage begins with a global template search at step 62 
This is followed by a movement detection step 63 and a local target 
search 64. A step 65 checks whether tracking has been lost. If so 
control returns to step 62 to perform another global template search. If 
tracking has not been lost, control returns to the motion detection step 
63. Thus, steps 63 to 65 form a tracking loop which is performed for as 
long as tracking is maintained. The motion detection step 63 supplies 
position data as indicated at 66 by a differential movement method 
which determines the movement of the target image between 
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consecutive fields and adds this to the position found by local template 
matching in the preceding step 64 for the earlier of the fields. 

Figure 18 illustrates the tracking method of Figure 17 in more detail. 
The interactive template capture step 61 makes use of the display 7 and 
the mouse 60 to allow the user to select the target image which is to 
form the template. During this mode of operation, as shown in Figure 
19, the display 7 displays an image of the observer 8 as captured by the 
video camera 3. The processor 4 overlays the image with a graphical 
guide 67 of the required template size and with text indicating that the 
observer should place himself so that his eyes are inside the rectangle 67 
on the display 7 and aligned with the middle line 68. When the 
observer has correctly positioned himself with respect to the graphical 
guide 67, he operates a button of the mouse 60 so that the processor 4 
captures and stores the part of the image of the observer inside the 
graphical guide 67 for use as a template or target image. 

Alternatively, the mouse may be used to drag the graphical guide 67 so 
that it is correctly aligned with the observer's eyes, after which the 
mouse button is pressed to store the target image. 

An advantage of the interactive template capture 61 is that the observer 
is able to make the decision on the selection of the template with 
acceptable alignment accuracy. This involves the recognition of the 
human face and the selection of the interesting image region, such as the 
eye region. Whereas human vision renders this process trivial, template 
capture would be difficult for a computer, given all possible types of 
people with different age, sex, eye shape and skin colour under various 
lighting conditions. In fact, template capture can be performed for 
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determined during the previous template matching step so that the 
position data X 0 and Y 0 output by a step 78 to the window steering 
mechanism 6 via the display control processor 5 is formed as X 0 +AX, 
Y 0 + AY. 

After the step 78, a step 79 applies template matching to the image f 2 as 
described hereinafter. In particular, a hierarchical search is applied to a 
small region of the image centred at the position X 0 , Y 0 . The template 
matching involves a cross-correlation technique and a step 80 detects 
whether tracking is lost by comparing the best correlation obtained in the 
step 79 with a preset threshold. If the best correlation is less than the 
preset threshold, control returns to the step 72 of the global template 
search 62 so as to relocate the position of the target image within the 
next available digital image. If the best correlation is greater than the 
preset threshold, step 81 updates the position data by entering the 
position of the best correlation as the parameters X 0 and Y 0 . Control 
then returns to the step 75 and the steps 75 to 81 are repeated for as 
long as observer tracking is required. 

The template matching step 73 is of the type described hereinbefore with 
reference to Figures 12 to 14. It is necessary to locate the target image 
within the whole image area as an initial step and whenever tracking is 
lost as determined by step 80. In particular, the differential movement 
detection method 76 cannot begin until the position of the target image 
within the whole image is known. 

In a preferred embodiment, template matching is performed by cross- 
correlating the target image in the template with each subsection overlaid 
by the template as described with reference to Figure 14. The similarity 
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be-ween the template and the current subsection of the imag e ma y be 

us d IT 7 TT^ ^ ^ n0rma " Sed — is 

) , 6 n0rma " Sed ~'a.ion 

C(* 0 , Vo > for ,he coordinates x 0 ,y 0 a, the top left corner of the image area 

to be matched with the template is calcu.ated as follows: 



where f(x,y) is a function Renting the incoming image- 

Tp(x,y) is a function representing the target image of' the template- 
(x,y) are the coordinates of each pixel in the template- 
M ,s the total number of pixels in the template; 
Q is the autocorrelation of the image f(x,y); and 
C Tp is the autocorrelation of the template Tp(x,y). 

The autocorrelations are given by the expressions: 



C6 ~ivr(x l: y) f(x + x o.y + yo) 2 



n _ 1 
Tp - 77- 2 Tp(x,v) 2 



The va.ues of the cross-correlation C(x 0 , yo , are in the range of ,0 0 , 0] 
where the maximum va.ue , .0 is achieved when the template is identical 
to the underlying image. 
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The same template matching technique is used in the step 79 but, in this 
case, template matching is applied to a relatively small region within the 
currently available image as illustrated in Figure 20. The target image 
position located in the previous image field is indicated at 85. During 
the time interval between consecutive video image fields, the maximum 
movement which can occur is limited by the maximum speed of 
movement of the observer. In the specific example of hardware 
described hereinbefore, this corresponds to a maximum vertical or 
horizontal movement 86 of about 8.33 millimetres. Accordingly, the eye 
region of the observer must be within a boundary 87 having the same 
(rectangular) shape as the template and concentric therewith but taller 
and wider by 16.67 millimetres. The template matching step 79 is thus 
constrained within the boundary 87 so as to minimise the computing 
time. 

In order further to optimise the template matching of the steps 73 and 
79, an optimised hierarchical template matching technique is adopted as 
illustrated in Figure 21. Template matching is performed in first and 
second sub-steps. In the first sub-step, template matching is performed at 
sparsely displaced positions within the whole image for the step 73 or 
within the boundary 87 for the step 79. Instead of using all pixels of the 
template 47 to perform the cross-correlation with the underlying image 
section, only the image elements such as 88 and 89 at the intersections 
of a relatively coarse grid of lines are used so that the template and the 
underlying image region are subsampled to reduce the volume of data 
which has to be processed in the cross-correlation calculation. Further, 
the data representing each image element may be truncated so as to 
reduce the calculation time. 
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The last detected position of the target image is indicated at 90. When 
the first sub-step of the template matching is complete, it is, for example, 
found that the maximum correlation occurs for the target image centred 
on the position 91. 

The second sub-step is then performed so as to refine the new position of 
the target image. The same cross-correlation calculations are performed 
but, in this case, the search is confined to a smaller region 92, each of 
whose dimensions is twice the sparse step in the first sub-step. This sub- 
step is performed with a finer step and higher image element resolution 
and results in a refined position 93 being found for the target image in 
the currently processed image field. 

Although the template matching step may be divided into more than two 
sub-steps, it has been found that, in practice, a two sub-step arrangement 
is adequate in terms of efficiency and accuracy. This is because, in the 
first sub-step, the step size between two neighbouring positions cannot 
be too large as otherwise it might easily miss the true "coarse" position. 

As described hereinbefore, the steps 74 and 80 compare the best cross- 
correlation values obtained in the steps 73 and 79, respectively, with a 
preset threshold to determine whether the target image is present in the 
current image (in the case of the step 74) or is present within the 
boundary 87 of the image (for the step 80). In theory, the cross- 
correlation value at the best-matched position would be 1 if: 



the head movement were translational; 
there were no intensity change; 

the camera were linear without any defect of optics; 
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there were no random noise from the electronic circuitry; and 
there were no digitisation errors. 

In practice, these conditions cannot be satisfied so that the template will 
not find a perfect match in the image. The best cross-correlation value is 
therefore compared with the preset threshold to establish whether an 
acceptable match has been found and to prevent the system from locking 
onto inappropriate image portions of relatively low cross-correlation 
value. The preset threshold is determined heuristically by experimenting 
with a large number of people of different types under various lighting 
conditions. A typical value for the threshold is 0.5 for the case where 
normalised cross-correlation with a maximum value of 1.0 is used. 

Although the template 47 may have any shape, a rectangular shape is 
preferred because it is easier for computer processing. The size of the 
template is important in that it can affect the computing efficiency and 
accuracy in determining the position of the target image. Larger 
templates tend to produce a sharper peak correlation so that the peak 
position can be determined more accurately, mainly because a larger 
image region contains more features of the face so that a small 
movement away from the peak position would change the cross- 
correlation value more substantially. However, a larger template requires 
more computing time. Also, the template should not exceed the 
boundary of the face of the observer in the image so as to prevent 
template matching from being affected by the background content of the 
images. A typical size for balancing these factors is one which is just 
large enough to cover the two eye regions of the observer in the images. 
For the parameters described hereinbefore, a template size of 150 by 50 
image picture elements is suitable. 
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Although a point midway between tho »„„ e „r • 
„ . , 6 eyes of the observer is used to 

control steering of the viewing window, ^ • • 
such Windows, other positions of the template 

uch as the corners may be used if the calibration referred to 
hereinbefore is performed by a human observer. An offset is imnr „ 
and automatically included In the results of the cahbratn! 

The differential method 76 measures the relative movement between two 
consecutive image fields and assumes tha, this movement has been 
Purely trans ationa,. The movement is determined from the intens „ 

Zz n :: "~ consecu,ive image fiews — * *wn 

Tay,or approximations, for instance as disclosed in T.S. Huang (Editor, 
'™ge Sequence Analysis, ISBN 3-54CM0919-6 !983 ,f f r* T 
- intensity of an image feature in the first of ^^T t 

rr;:;" ,h ; imase feature — » » - » * z:l 

x + A x,y + Ay, ,„ , ne second fie|d/ (hen (he grey )eve( 
the second frame has the same grey level, i.e: 

f|(x,y)-fj(x + Ax,y+Ay) 

'f the amount of motion is small, the right hand side of the above 
equat,on can be approximated using the Taylor expansion truncated to 
the first order differential terms as follows: 

f, (*. y) = f 2 (x, y) + Ax^iZ) + Ay ^(xoO 

dx dy 



<n this equation, Ax and Ay are the unknown representing the movement 
Thus, a pair of pixels from two consecutive images produce one 
equation so that two pairs of pixels produce a linear system of two 
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Mrh ran be solved for the two unknowns to give the 
ecuauons, which ~ * » n(jmber of pixels h use d to 

amount of movement. In practice, a g 

red uce random errors using the well known leas, s q uare method, 
instance. 50 to 60 pairs of image picture elements may be used. 

The pairs of elements should be selected from the target image, for 
ns ance the eye regions of the observe, However, the actual position o, 
the targe, is no, known before the amount of movement between 
the target Drac tice because there is a 

consecutive fields has been determined. In practice. 
lim i, to the actual amount of movement between two consecutive fieWs 
described hereinbefore with reference to Figure 20 i, ,s possible to 
choose the pairs of pixels from a region 94 shown ,n Figure Owhich 
W» always contain parts of the head in conserve ,,m«e 
irrespective of the direction of movement bu, prov.ded the speed o 
ILent of the observer is less than the designed maximum speed. 

,n order to il.ustrate the differential method more cleady. a one- 
dimensional example is shown in Figure 22. In this case, the Taylor 
approximation reduces to: 



df,(x) 
f,(x) = f 2 (x) + Ax-_- 



This may be rewritten as: 

Af = Ax — 



where: 



Af=f,(x)-f 2 (x) 
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The curves 95 and 96 shown in Fi KU re 2, 
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Where the summation is over the 3x3 window centred at the current 
pixel. The minimisation is achieved when the partial derivative with 
each parameter is zero. This provides a system of equations which can 
easily be solved. The partial derivatives of f(x,y) are then calculated as: 



df(x,y) 
ax 



= 2 ax + cy + d 



df(x,y) 

ay 



= 2by + cx + e 



for a 3x3 window, the final expressions may be represented by the 
following filters: 
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which are the conventional Prewitt edge detectors. 

The differential method 76 is computationally efficient and, in the 
specific embodiment as described herein, requires only about 2 
milliseconds while providing suitable accuracy so that the output of 
position data supplied at step 78 may be used to control the steering 
mechanism 6 of the autostereoscopic display. However, the motion 
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de.ecr.on me ,hod canno, be used on its own ,„ achieve robus( 
because .here are a,way s measuring errors in .he de.ec.ed movemen, 
Merely repeating the motion de.ec.ion method without intervening 
correction causes errors to accumuia.e so that accurate tracking is rapidly 
lost The .emp,a.e matching step 79 is necessary ,o provide an accurate 
posmon measurement for each consecutive fie.d and the errors occurring 
■n the d.fferential method 76 for a single iteration are too small to affect 
accurate obse-er .racking. The targe, verification provided by the step 
79 venf.es ha, a, .he de.ected position, the tatge. image is indeed there 
and also refines the position data before ,he nex. mo.ion detection step 
76 As mentioned hereinbefore, the motion detection step takes about 2 
m.l„seconds which, in the case of a field repetition rate of 60 Hz leaves 
about 14.7 milliseconds before .he nex. digitised image is ready This 
wa,, ln g time" is sufficient for the template matching step 79 ,o perform 
the target verification. 

The step 77 adds the movement determined in the step 76 to the 
Pos,..on determined in the step 79 for the previous image field. Thus 
accumulate errors in the differential mefhod s.ep 76 are avoided since 
the results of only one differentia, method step 76 (with only one set of 
errors) are used ,o indicate the targe, image position a, the step 78 
Although the template matching step 79 calculates a target image 
posmon containing errors, such errors do no. accumu.ate because ,h 
template is ma,ched each time ,o ,he curren, image of ,he ,arge, Th 
resulting position data is thus always the true position of the targe, plus a 
single measuring error. 



ie 
ie 



The motion detection and template matching work together in an 
efficen, way. The motion detection produces position da.a quickly so 
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that the time latency is as short as possible. The result of motion 
detection confines the search area for template matching. The use of 
template matching confirms the target position and prevents the 
accumulation of measuring errors due to motion detection. This efficient 
combination makes it possible to produce a reliable tracking system 
which is suitable for observer tracking autostereoscopic 3D displays in 
that it satisfies the requirements of short-time latency, high update 
frequency and sufficient measurement accuracy. 

Figure 23 illustrates the timing of the steps 76 and 79 in relation to the 
timing of digitisation of the sequence of image fields. Digitisation starts 
at the time indicated at 97 and, using the NTSC video camera with a 60 
Hz field rate as shown in Figure 16, each field is digitised in a period of 
16.67 milliseconds. The processor 4 contains a ring buffer of the type 
illustrated in Figure 1 1 so that each field is available from its respective 
buffer memory whilst the subsequent field is being captured. 

Assuming that- computing starts at the time indicated at 98 and the global 
template search 62 has already been performed to give an initial position 
of the target image, the step 76 is performed in about 2 milliseconds and 
the position data Pmd is then available for adding to the position Ptm 
obtained in the preceding step 79 by template matching. Thus, the step 
76 begins immediately after a fresh field has been captured and digitised. 
The step 79 ends immediately after the step 76 has been completed and 
takes about 10 milliseconds. Thus, the whole image tracking for each 
image field is completed within the time required to digitise an image 
field so that the repetition rate of the observer position measurements is 
equal to the field repetition rate of 60 Hz. The latency of position 
measurements is 18.7 milliseconds and an X accuracy of better than 5 
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millimetres can be obtained. This is sufficient for ,h* • 



The global template search 62 takes a fixed t,W «f 
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- into the ,, d of view and , located againt 

The differentia, method 76 works well when the motion between two 
consecutive fie.ds does not exceed 3 to 4 image picture £Z rl 

about the average speed of an observer for a typical 
the autostereoscopic display shown in Fi gur e 2 but , in practice J 
o server may move twice as fas, as this from time to til. , ^1 

by 7 pixels betw~>n ' mage has mov ^ 

y ' pixels Detween consecutive fields h,* .h= t- . ■ 

movement nf 4 „■ , u ' eWS but the f,rst "nation estimates a 

ovemen, of 4 p.xels, the second image field can be shifted by 4 pixels 
before another iteration The r»l=,«„. ^ pixels 

field and ,h • : here ' at,Ve cement between the shifted 

U PreV '° US fie,d is "™ 3 pixels which may be 

accurately measured. Although more than two iterations may be used 
<wo „era.,ons have been found to be adequate in practice. 

tt^T^\ the observer posiMon - ^ - 

terms of Cartesian coordinates as illustrated in R gure , Thp , , . 
— described hereinbefore implements XV *JZL^T* 
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position data alone may be sufficient for steering the viewing windows ,f 
the observer does not move in the Z direction but remains in the 
window plane 23 of the display. Usually, Z tracking is required if the 
observer is to move in the Z direction. However, for a special case 
where the optical centre of a lens 99 of the camera 3 is aligned with the 
optical centre 29 of the display, Z tracking is not explicitly required. 

As illustrated in Figure 7, the viewing zones 28 are angularly spaced in 
the horizontal XZ plane. With the optical centre of the camera lens 99 
aligned with the optical centre 29 of the display as shown in Figure 24, 
all points on the same switching line such as 100 are imaged to the same 
point such as 101 on the image plan 102 of the camera 3. The X 
position of the camera pixel image therefore indicates the angular 
position of the observer and can be supplied directly to the optical 
system of the display 7 to provide correctly steered viewing windows 
without requiring any knowledge of the Z position of the observer. Th.s 
particular arrangement therefore allows the use of a single camera w.th 
increased accuracy and shortened response time so that the tracking 
system is particularly suitable for this type of autostereoscopic display. 

As illustrated in Figure 4, autostereoscopic displays typically allow some 
longitudinal viewing freedom for the observer. In particular, so long as 
the eyes of the observer remain in the appropriate diamond shaped 
viewing zones 21 and 22, the observer will perceive a 3D image across 
the whole of the display. However, movement in the longitudinal or Z 
direction causes a change of the size of the target image as the observer 
moves towards and away from the video camera 3. The differential 
method 76 uses the latest two consecutive image fields so that the target 
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may only move a small distance Th»,„i' * 

3 nrf allng effect is therefore minimal 

and does not cause a serious problem. 

The scajing effect is more important to tne template matching steps 73 
nd 79 because each image field is always searched with a Ld 
template acpuired during the template capture step 6, . The res!,, of the 
scal.ng effect is that the maximum correction is lower than the o r 
value of 1 For ^ . er tnan the optimum 

jeof,. For.be speaf.c embodiment described hereinbefore the 
tradcng system can toierate longitudinal movement of about ,50 

:i m :r r ards or bacwds ! ™ *• * — — - *« 

metre w„h a measunng error no, exceeding S millimetres. 

As described hereinbefore, a prese, threshold is used in the sten, 7. , 
30 , te s, whe,her tbe target image has been located The preset 
threshold has to be sufficiently small so ,na, i, can 
Afferent people a, different positions and orientations under various 

so that , can d.scnminate a true target image from a false one Targef 

r:; t r ^ enhanced by ^ — 

m o s such „ hue ,H) saturation ffl value (V, measurement as 
illustrated rn Figures 25 and 26. 

The image tracking method shown in Figure 25 differs from tha, shown 
•n F,gures 17 and ,8 in tha, steps ,03 and ,04 are inserted I, h^ b 

-a, the saturation of the images is less affected £^ 
M*ng cond.tions than other image features, such as grey level With 
-form illumination, both hue and saturation of the face chang e 
moothly over a large portion of the face. Even with non-uniform 
..'-nafon, - — - - - image remains fairly balanced on both 
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sides of the observer face whereas the intensity picture could be visibly 
dark on one side and bright on the other. Also, the average saturation 
value over the observer face region differs from that of the background 
more significantly. The mean value of the saturation over the observer 
face changes very little during head movement. This therefore provides 
an additional check for target verification. 

Figure 26 illustrates HSV as a double cone containing all possible 
colours of light. The axis of the double cone represents a grey scale 
progression from black to white. Distance from the axis represents 
saturation. Angular direction around the axis represents hue. 

In the case of a video camera 3 providing red, green and blue (RGB) 
outputs, conversion to HSV format may be obtained by finding the 
maximum and minimum values of the RGB signals. The V component is 
then given by the value of the maximum signal. The saturation S is 
defined as zero when the V component is zero and otherwise as the 
difference between the maximum and minimum RGB values divided by 
the maximum value. The hue is computed as follows: 



H= 



0 for V = 0 

(G-B)/d for cmax. = R 

2+(B-R)/d for cmax = G 

4+(R-G)/d for cmax = B 



where d is the difference between the maximum and minimum RGB 
values and cmax is the maximum of the RGB values. 
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If the peak correlation value exceeds the threshold, ,he step ,03 c„ 

the region of the image field underlying the template at th 

position from RGB to HSV f V template at the optimum 

m kgb to HSV format as described hereinbefore Th» 
value s, of the target image is calculated and compared t h th 
juration of the template (which is fi xe d and 
« nng the template capture step 6„. The difference S i me 
saturation is calculated as I si o I /c* ^ • 

another predetermined ^'.^ST ^ '« 

n p *u 5 typical value between 0 7 ar .^ 

Various modifications may be made within the scooe of rh ■ 
- „ , h . „„,,„ ^ 

or r-h.v. c„* u- . special purpose silicon chin 

Such a camera may haye a first output for a standard video signal and a 

cond output for position data. A system integrated directly he 
camera sensor would no, retire digitai/analogue and analo u^gtl 
converters to generate and process analogue video data, thu s P 0~ Z 1 
an opportunity of reducing cos, and improving perform nee 8 
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CLAIMS 

1 . An image tracking system comprising first means for presenting a 
sequence of images, second means for determining the position of a 
target image in a previously presented image from the first means, third 
means for determining movement of the target image between the 
previously presented image and a subsequently presented image from the 
first means, and fourth means for indicating the position of the target 
image in the subsequently presented image as the position determined 
by the second means modified by the movement determined by the third 
means. 

2. A system as claimed in Claim 1, in which the subsequently 
presented image is consecutive with the previously presented image. 

3. A system as claimed in Claim 1 or 2, in which the previously 
presented image comprises each of the images of the sequence in turn. 

4. A system as claimed in any one of the preceding claims, in which 
the first means is : arranged to present the sequence of images in real 
time. 

5. A system as claimed in Claim 4, in which the first subsequently 
presented image is the currently presented image. 

6. A system as claimed in Claim 4 or 5, in which the first means 
comprises a video camera. 
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7. 



A system as claimed in any one of .he preceding claims, in which 
<ne f,rs, means comprises a memory for storing the previously presented 
■mage and the subsequently presented image. 

8- A system as claimed in any one of the preceding claims, in which 
•he sequence of images comprises consecutive fields of interlaced fields. 

9- A system as claimed in any one of the preceding claims, in which 
the second, third and fourth means comprise a programmed data 

processor. 

10. A system as claimed in any one of the preceding claims, in which 
the fourth means is arranged to add the movement determined by the 
third means to the position determined by the second means. 

11. A system as claimed in any one of the preceding claims, in which 
the th.rd means is arranged to determine the movement as soon as the 
subsequently presented image has been presented by the first means. 

1 2. A system as claimed in Claim 1 i , j„ which the second means is 
arranged to determine the position of the target image on the 
subsequently presented image as soon as the third means has determined 

the movement. 

13. A system as claimed in any one of the preceding claims, in which 
the second means is arranged to search for the target image in a firs, 
■mage portion which is smaller than the images of the sequence and 
which includes the position indicated by the fourth means 
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14. A system as claimed in Claim 1 3, in which the position indicated 
by the fourth means is substantially at the centre of the first image 
portion. 

15. A system as claimed in Claim 13 or 14, in which the second 
means is arranged to search for the target image in the whole of the 
previously presented image if the search in the image portion is 

unsuccessful. 

1 6. A system as claimed in any one of Claims 1 3 to 1 5, in which the 
second means is arranged to search for the target image in the whole of 
an initial previously presented image. 

17. A system as claimed in any one of Claims 13 to 16, in which the 
second means is arranged to perform template matching of the target 
image at a plurality of first positions in the first image portion to find the 
best match. 

18. A system as claimed in Claim 17, in which the second means is 
arranged to perform template matching of the-target image at a plurality 
of second positions which are more finely spaced than the first positions 
and which are disposed adjacent a position corresponding to the best 
match. 

1 9. A system as claimed in Claim 1 7 or 1 8, in which the second 
means is arranged to perform a correlation between the target image and 
a respective region corresponding to each of the ; first "and" (when present) 
second positions and to select the highest correlation. 
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20. A system as claimed in Cairn , 9, in which ,he second me a • 
^^^^^^^ 

of one of .he ima g es of the s^I * • P-*» 

22. A system as claimed in Claim 21, in which the fifth means 
compnses a display for displaying the sequence of images a 1 

generator for generating a border i mage on the display I d a IT 

operable control for actuating capture of *„ • 

border image. ,ma8e regi ° n within *e 

23. A system as claimed in Claim 22, in which the fifth means 

25. A system as claimed in Calm 24. in which the third means is 
arranged to determine translations movement of the target ^ 
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A-^srenvas claimed in Claim 25, in which th» ^ 
arranged to solve a set of equations: ^ ^ m «™ is 



1 
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3x dy 

where x, and y, are Cartesian coordinates of an ith image element, i is 
each integer such that 1<i<j and j is an integer greater than one, f, and f 2 
are functions representing the previously and subsequently presented 
images and Ax and Ay are the Cartesian components of the movement. 

27. An observer tracking autostereoscopic display including a system 
as claimed in any one of the preceding claims. 

28. A display as claimed in Claim 27, in which the first means 
comprises a video camera disposed at an optical centre of the display. 

29. An image tracking method for sequentially presented images, 
comprising determining the position of a target image in a previously 
presented image, determining movement of the target image between the 
previously presented image and a subsequently presented image, and 
indicating the position of the target image in the subsequently presented 
image as the position in the previously presented image modified by the 
determined movement. 
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